The following scatter plot reflects calculated risk of patient’s future cost based on the CMS-HCC algorithm. The patient data is taken form one of CMS’ Public Use Files and are completely identified.
The following code is used to merge the datasets(one for each year). I have outputted the summary. The risk algorithm was run on 114,000 synthetic patient records.
result08 <- read.csv('../JupyterDemo/data/_2008Results.csv')
result09 <- read.csv('../JupyterDemo/data/_2009Results.csv')
result10 <- read.csv('../JupyterDemo/data/_2009Results.csv')
result1 <- merge(result08[,c('HICNO','SCORE_INSTITUTIONAL')], result09[,c('HICNO','SCORE_INSTITUTIONAL')], by='HICNO', suffixes = c('08','09'))
#result08[,c('HICNO','SCORE_INSTITUTIONAL')]
#,'SCORE_INSTITUTIONAL']
patResults <- merge(result1, result10[,c('HICNO','SCORE_INSTITUTIONAL')], by='HICNO', suffixes = c('a','10') )
rm(result08, result09, result10, result1)
summary(patResults)
## HICNO SCORE_INSTITUTIONAL08 SCORE_INSTITUTIONAL09
## 00013D2EFD8E45D1: 1 Min. : 0.000 Min. : 0.000
## 00016F745862898F: 1 1st Qu.: 0.000 1st Qu.: 0.000
## 0001FDD721E223DC: 1 Median : 0.928 Median : 1.353
## 00021CA6FF03E670: 1 Mean : 1.541 Mean : 1.828
## 00024B3D2352D2D0: 1 3rd Qu.: 2.366 3rd Qu.: 2.885
## 0002DAE1C81CC70D: 1 Max. :14.838 Max. :13.809
## (Other) :114532
## SCORE_INSTITUTIONAL
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 1.353
## Mean : 1.828
## 3rd Qu.: 2.885
## Max. :13.809
##
r08 <- hist(patResults[,'SCORE_INSTITUTIONAL08'], breaks=30, plot=FALSE)
breaks08 <- r08$breaks
mids08 <- r08$mids
counts08 <- r08$counts
r <- hist(patResults[,'SCORE_INSTITUTIONAL09'], breaks=breaks08, plot=FALSE)
mids09 <- r$mids
counts09 <- r$counts
r <- hist(patResults[,'SCORE_INSTITUTIONAL'], breaks=breaks08, plot=FALSE)
mids10 <- r$mids
counts10 <- r$counts
histdata <- data.frame(mids08, counts08, mids09, counts09, mids10, counts10)
For the plot, I chose a histogram. I had assumed I would make it myself by placing bars at the midpoints outputted by the hist function. I then grouped the bars one for each year to show the changes.
As shown, patients in 2010 were overall much riskier than in their previous years. This was not the most efficient way to create the plot. See below
p <- plot_ly(histdata, x = ~mids08, y = ~counts08, type = 'bar', name = '2008 Risk Buckets', marker = list(color = '#ffea00')) %>%
add_trace(y = ~counts09, name = '2009 Risk Buckets', marker = list(color = '#00ffea')) %>%
add_trace(y = ~counts10, name = '2010 Risk Buckets', marker = list(color = '#ea00ff')) %>%
layout(yaxis = list(title = 'Number of Patients'), barmode = 'group', xaxis= list(title = 'Patient Risk Scores'))
p
## Warning in arrange_impl(.data, dots): '.Random.seed' is not an integer
## vector but of type 'NULL', so ignored
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
I later found that plot.ly has a built in Histogram function. (though not well documented in the R API). Here is the same data plotted as a histogram overlay. Same as the other chart, 2008 has more patients in the most numerous low-risk buckets with 2010 being a year when these patients produce more cost and healthcare risk.
I also applied a log scale to the Y-axis makes for a better distribution though it is deceiving.
q <- plot_ly(alpha=.5) %>%
add_histogram(x = patResults[,'SCORE_INSTITUTIONAL08'], name= '2008 Risk Scores', autobinx=FALSE, xbins=list(start=0, end=9, size=.4)) %>%
add_histogram(x = patResults[,'SCORE_INSTITUTIONAL09'], name= '2009 Risk Scores', autobinx=FALSE, xbins=list(start=0, end=9, size=.4)) %>%
add_histogram(x = patResults[,'SCORE_INSTITUTIONAL'], name='2010 Risk Scores', autobinx=FALSE, xbins=list(start=0, end=9, size=.4)) %>%
layout(barmode = "overlay", yaxis = list(type='log', title="# of Patients Log Scale "),bargap=0.10)
q